Add persistent program cache for Program.compile#1912
Add persistent program cache for Program.compile#1912cpcloud wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
de57bd8 to
ac38a68
Compare
|
f1ae40e to
b27ed2c
Compare
|
Thanks, Phillip! I have this PR in my review backlog 🙏 The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers. |
3a32786 to
cad93d0
Compare
|
Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note). High -- source-directory include. Medium -- over-eviction race. Low -- star-import. Added a note in |
|
@leofang -- yes, all three backends are designed and tested for concurrent access, with different scopes:
Cross-process coverage in
One concurrency bug this review shook out (over-eviction after a suppressed |
457cab7 to
cfddd08
Compare
cfddd08 to
fce123f
Compare
|
FWIW, I briefly explored "safe pickle" and "signed pickle blobs" in this chat: The conclusion there is:
|
7d1cb23 to
86dab90
Compare
86dab90 to
a60f1c6
Compare
leofang
left a comment
There was a problem hiding this comment.
Thanks, Phillip! Sorry for the long wait. Sending out the first wave of my review. Will continue asap.
Note: It would be nice if we can break up the two largest files (cuda_core/cuda/core/utils/_program_cache.py and cuda_core/tests/test_program_cache.py, each are 1-2k lines) into smaller logical units.
| # ``name_expressions`` is incompatible with the cache: NVRTC | ||
| # populates ``ObjectCode.symbol_mapping`` from name-expression | ||
| # mangling at compile time, and that mapping isn't carried in | ||
| # the binary bytes the cache stores. Without this guard the | ||
| # first call (cache miss) would return an ObjectCode with | ||
| # symbol_mapping populated, while every subsequent call (hit) | ||
| # would return one without -- silently breaking later | ||
| # ``get_kernel(name_expression)`` lookups that work on the | ||
| # uncached path. Fail loud here instead. | ||
| if name_expressions: | ||
| raise ValueError( | ||
| "Program.compile(cache=...) does not support name_expressions: " | ||
| "ObjectCode.symbol_mapping is populated by NVRTC at compile " | ||
| "time and is not preserved across a cache round-trip, so cache " | ||
| "hits would silently break get_kernel(name_expression) lookups " | ||
| "that the uncached path supports. Compile without cache= when " | ||
| "name_expressions are needed, or look up mangled symbols by " | ||
| "hand from the cached ObjectCode." | ||
| ) |
There was a problem hiding this comment.
Note to self: I need to address this after 1.0 is out, xref: cupy/cupy#9801
| def __getitem__(self, key: object) -> bytes: | ||
| k = _as_key_bytes(key) | ||
| with self._lock: | ||
| try: | ||
| data, _size = self._entries[k] | ||
| except KeyError: | ||
| raise KeyError(key) from None | ||
| # Touch LRU: a real read promotes the entry to "most recent" | ||
| # so eviction prefers genuinely cold entries. | ||
| self._entries.move_to_end(k) | ||
| return data |
There was a problem hiding this comment.
Q: What would be our recommended way of using InMemoryProgramCache in a multi-GPU env? Wondering about this because we usually have each GPU driven by a thread, and if the intended use case is a global cache object (which makes sense on a homogeneous system like DGX) this would cause serialization.
In CuPy internally there is a per-device cache so this issue is avoided.
|
@leofang -- on the multi-GPU question. Two options worth weighing, both viable: Background. Option A -- document the dict-of-caches pattern, no API change. caches = {d.device_id: InMemoryProgramCache() for d in devices}
# per thread:
program.compile(\"cubin\", cache=caches[Device().device_id])
Option B -- ship a `PerDeviceProgramCache` routing wrapper.
I lean A for the first cut: the common case (homogeneous DGX, single SKU) is correctly served by one shared cache and benefits from cross-device amortisation, the heterogeneous-arch case is a 3-line dict away, and B can ship post-1.0 if real workloads make it pattern enough to deserve a class. Happy to pivot to B if you'd rather have it on the public surface from day one. |
Add a bytes-in / bytes-out cache abstraction and two backends for
caching compiled CUDA programs across process boundaries.
* ``ProgramCacheResource`` -- abstract base. Concrete backends store
raw binary bytes keyed by ``bytes`` or ``str``; reads return the
same payload. ``__setitem__`` accepts ``bytes``, ``bytearray``,
``memoryview``, or any :class:`~cuda.core.ObjectCode` (path-backed
too -- the file is read at write time so the cached entry holds
the binary content, not a path that could move). Provides default
``get``, ``update`` (mapping or pairs), ``close``, and context
manager. ``__contains__`` is intentionally NOT abstract: the racy
``if key in cache; data = cache[key]`` idiom is steered toward
``cache.get(key)`` instead.
* ``InMemoryProgramCache`` -- single-process LRU on
``collections.OrderedDict`` with ``threading.RLock`` and a
size-only cap. Reads promote via ``move_to_end``.
* ``FileStreamProgramCache`` -- directory of atomic per-entry files.
Writes stage to ``tmp/`` then ``os.replace`` into
``entries/<2-char>/<blake2b-hex>``; concurrent readers never see a
torn file. Each entry is the raw compiled binary (no pickle, no
framing) so files are directly consumable by external NVIDIA
tools (``cuobjdump``, ``nvdisasm``, ``cuda-gdb``). Eviction is
true LRU via ``st_atime`` (the read path calls ``os.utime`` to
bypass ``relatime`` / ``NtfsDisableLastAccessUpdate`` /
``noatime``). Stat-guarded prunes refuse to unlink entries
another process replaced mid-eviction. ``tmp/`` is recreated on
every write so an external wipe doesn't crash later writes.
Default cache directory comes from
``platformdirs.user_cache_path("cuda-python", appauthor=False,
opinion=False) / "program-cache"``.
* Windows sharing-violation handling -- ``os.replace``,
``path.stat() + read_bytes()``, and ``path.unlink`` all retry on
winerror 5/32/33 with a bounded backoff (~185 ms). The
``_is_windows_sharing_violation`` predicate filters EACCES only
when ``winerror`` is absent so non-sharing winerrors propagate as
the real config errors they are. Off-Windows ``PermissionError``
always propagates.
* ``make_program_cache_key`` -- escape hatch for callers whose
compile inputs require an ``extra_digest`` (header / PCH content
fingerprints, NVVM libdevice). Builds a 32-byte blake2b digest
via a backend-strategy pattern: a ``_KeyBackend`` ABC with
per-code-type subclasses (``_NvrtcBackend``, ``_LinkerBackend``,
``_NvvmBackend``) owns each backend's validation, code coercion,
option fingerprinting, name-expression handling, version probe,
and extra-payload hashing. The orchestrator dispatches via
``_BACKENDS_BY_CODE_TYPE[code_type]`` and assembles the digest in
fixed order. Backend gates match ``Program.compile``: rejects
inputs the real compile would reject (side-effect options,
external-content options without an ``extra_digest``,
driver-linker-unsupported options, NVRTC ``options.name`` with a
directory component). NVVM ``extra_sources`` is hashed in
caller-provided order because NVVM module linking is
order-dependent in the general case (overlapping symbols, weak
definitions); canonicalising would silently change behavior for
order-dependent inputs.
Adds ``platformdirs >=3.0`` to ``cuda_core/pyproject.toml`` and the
matching pixi manifests.
Tests cover the abstract contract, key-construction matrix
(deterministic, supported-target gates, backend-probe taint, gate
canonicalization, side-effect / external-content / dir-component
guards, schema version mixing), single-process CRUD and LRU,
atomic-write race coverage, atime LRU promotion, stat-guarded
prune / atime touch / clear / size-cap, default-dir resolution via
platformdirs, the ``_is_windows_sharing_violation`` predicate's
truth table including the regression case (non-sharing winerror
plus EACCES propagates), tmp-dir recreation after external wipe,
multiprocess concurrent writers / reader-vs-writer torn-file
safety / size-cap eviction race.
Adds a ``cache=`` keyword to :meth:`cuda.core.Program.compile` that threads the persistent cache machinery into the high-level compile path. With ``cache=None`` (the default) the call is byte-identical to the un-cached path -- no key derivation, no extra import, no behavior change. When a cache is provided, the wrapper derives a key via :func:`~cuda.core.utils.make_program_cache_key` from the program's source, options, and target type; checks the cache; on hit, returns a fresh ``ObjectCode._init(hit_bytes, target_type, name=self._options.name)``; on miss, runs the underlying compile and stores ``cache[key] = compiled`` (the cache extracts ``bytes(obj.code)``). Two compile-time guards close obvious footguns: * ``name_expressions`` plus ``cache=`` raises ``ValueError``. NVRTC populates ``ObjectCode.symbol_mapping`` from name-expression mangling at compile time, and that mapping isn't carried in the binary the cache stores. Without this guard the first call (miss) would return an ObjectCode with mappings populated, while every subsequent call (hit) would return one without -- silently breaking later ``get_kernel(name_expression)`` lookups that work on the uncached path. Compiles that need name_expressions should run without ``cache=``, or look up mangled symbols by hand from the cached ``ObjectCode``. * Inputs whose compilation effect isn't captured by the key (``include_path``, ``pre_include``, ``pch``, ``use_pch``, ``pch_dir``, NVVM ``use_libdevice=True``, NVRTC ``options.name`` with a directory component, side-effect options like ``create_pch`` / ``time`` / ``fdevice_time_trace``) propagate the ``ValueError`` from ``make_program_cache_key`` -- those callers should use ``make_program_cache_key`` directly with an ``extra_digest`` covering the external content. Cache hits also mirror the uncached path's NVRTC-PTX loadability warning: when ``self._backend == "NVRTC"``, ``target_type == "ptx"``, and ``_can_load_generated_ptx()`` returns False, a ``RuntimeWarning`` is emitted before returning the cached bytes. Loadability is a property of the active driver, not of how the bytes were produced, so the warning applies equally to cached PTX. Supporting refactors: * Unify ``Program``'s source retention into a single ``_code`` field (was split between ``_code`` for NVVM and a separate ``_source`` for c++/ptx). ``_code`` is now always bytes; the cache wrapper decodes back to ``str`` for c++/ptx before passing to ``make_program_cache_key`` (which only accepts bytes for NVVM). * Move the actual compile call into a module-level ``_program_compile_uncached`` so tests can monkeypatch the seam without going through NVRTC. ``Program`` is a ``cdef class``, so its methods cannot be reassigned from Python -- the seam has to live outside the class. * The unified ``_code`` field also exposed a pre-existing bug on the NVVM path: the C pointer was being recomputed from the caller's original ``code`` argument rather than from ``self._code``, which crashed for ``bytearray`` inputs that the field's bytes coercion handled cleanly. Fixed; regression test added in ``test_program.py``. Tests in ``test_program_compile_cache.py`` cover both halves of the contract: the wrapper-level miss/hit/error paths against a recording stub (verifying it's duck-typed and doesn't require subclassing ``ProgramCacheResource``), the rejection paths (name_expressions, extra_digest-required options, side-effect options, NVRTC ``options.name`` with a directory component), the PTX loadability warning on cache hit (positive: warns when the driver can't load the cached PTX; negative: stays quiet otherwise), and a real NVRTC end-to-end roundtrip using ``FileStreamProgramCache`` across reopen so the bytes match across processes.
a60f1c6 to
d177450
Compare
|
Pushed d177450 addressing the first review wave: platformdirs dropped (
test_program_cache.py file split deferred. The 2179-line file is internally well-organised by section (key construction / InMemory / FileStream / multi-process), and after the source split the test reorganisation is mechanical churn rather than a clarity win. Happy to do it as a follow-up if you'd prefer. Multi-GPU question replied to in #issuecomment-4350987016. |
Drop the module-level __getattr__/__dir__ shim that lazily exposed the program-cache classes. Import them eagerly alongside the memoryview helpers. Remove the two tests that pinned the lazy-import behaviour. Also pick up ruff's auto-fixes (import-block blank lines, long-line reformat) and rename the unused classmethod argument to satisfy ARG005.
| Program caches | ||
| -------------- |
There was a problem hiding this comment.
Should we merge this section with CUDA compilation toolchain above, or at least move it and make them stay closer?
There was a problem hiding this comment.
Alternatively, just merging with the above section is fine too.
| self[key] = value | ||
|
|
||
| def close(self) -> None: # noqa: B027 | ||
| """Release backend resources. No-op by default.""" |
There was a problem hiding this comment.
This is no-op only for trivial backends. If users don't use it as a context manager (well... it's a separate discussion why they should..), then they better call .close() for consistency and portability (to another backend).
| obj = program.compile("cubin") | ||
| cache[key] = obj # extracts bytes(obj.code) | ||
| else: | ||
| obj = ObjectCode._init(data, "cubin") |
There was a problem hiding this comment.
ObjectCode has many public constructors. We must not teach about any private methods.
| Intentionally does NOT subclass ``ProgramCacheResource`` -- the wrapper | ||
| should be duck-typed, so we test the duck-typed surface directly. |
There was a problem hiding this comment.
Why don't we require each cache= instance to subclass from ProgramCacheResource?
|
|
||
| cpdef bint _can_load_generated_ptx() except? -1: | ||
| """Check if the driver can load PTX generated by the current NVRTC version.""" | ||
| def _can_load_generated_ptx(): |
There was a problem hiding this comment.
cpdef functions are also accessible from Python, is this change needed?
leofang
left a comment
There was a problem hiding this comment.
Review of the persistent program cache implementation. Findings are categorized inline as:
- Critical (1): Must fix before merging — cache-write failure drops a successfully compiled
ObjectCode. - Consideration (8): Performance/functionality concerns worth discussing; can be deferred but should be tracked.
- Nitpick (6): Not blockers for merging.
This excludes items already captured in Leo's and rwgk's earlier review comments (platformdirs removal, over-eviction race, source-directory include guard, ObjectCode._init in docstrings, cpdef→def change, duck-typed test, close() vs context manager, multi-GPU usage, star-import laziness, doc section placement, SQLiteProgramCache removal).
Overall this is a well-engineered piece of work — the TOCTOU handling, stat-guards, and atomic-write design are thorough and well-documented. The main concern is the cache-write failure path losing the compile result.
| # ``self._code`` is always stored as bytes (see ``Program_init``), | ||
| # but ``make_program_cache_key`` only accepts bytes when | ||
| # ``code_type == "nvvm"`` -- c++/ptx must be ``str``. Decode back | ||
| # to the original str for the NVRTC/linker paths so the generated | ||
| # key matches keys callers build by passing the str source | ||
| # directly. | ||
| code_for_key = self._code if self._code_type == "nvvm" else self._code.decode("utf-8") | ||
|
|
||
| key = make_program_cache_key( | ||
| code=code_for_key, | ||
| code_type=self._code_type, | ||
| options=self._options, | ||
| target_type=target_type, | ||
| ) | ||
| hit_bytes = cache.get(key) | ||
| if hit_bytes is not None: | ||
| # The uncached NVRTC path warns when the active driver can't | ||
| # load freshly-generated PTX; that loadability is a property | ||
| # of the driver, not of how the bytes were produced, so the |
There was a problem hiding this comment.
Consideration: logs silently receives nothing on cache hit.
When the cache hits, no compilation occurs, so logs.write() is never called. A caller relying on logs to confirm compilation ran (or to capture PTX output) will get silence with no indication it was a cache hit vs. a compile that produced no output. Worth a note in the cache= or logs parameter docstring, e.g.:
On a cache hit, no compilation is performed and
logsreceives no output.
| name for name in _DRIVER_LINKER_UNSUPPORTED_FIELDS if getattr(options, name, None) is not None | ||
| ] | ||
| if unsupported: | ||
| raise ValueError( | ||
| f"the cuLink driver linker does not support these options: " | ||
| f"{', '.join(unsupported)}; Program.compile() would reject this " | ||
| f"configuration before producing an ObjectCode." | ||
| ) | ||
|
|
||
| def option_fingerprint(self, options, target_type): # noqa: ARG002 | ||
| # For PTX inputs the Linker reads only a subset of ProgramOptions | ||
| # (see ``_translate_program_options`` in _program.pyx); fingerprint | ||
| # just those fields so shared ProgramOptions carrying NVRTC-only | ||
| # flags (``include_path``, ``pch_*``, ``frandom_seed``, ...) don't | ||
| # force spurious cache misses on PTX. | ||
| return _linker_option_fingerprint(options, use_driver_linker=self._decide_driver()) | ||
|
|
||
| def hash_version_probe(self, update): | ||
| # Only cuLink (driver-backed linker) goes through the CUDA driver | ||
| # for codegen. nvJitLink is a separate library, so a driver | ||
| # upgrade under it does not change the compiled bytes -- skip the | ||
| # driver version there. ``_linker_backend_and_version`` already |
There was a problem hiding this comment.
Consideration: _decide_driver() is called 2–3 times per key derivation for PTX inputs.
validate() calls self._decide_driver() at line 494, option_fingerprint() calls it again at line 515, and hash_version_probe() calls _linker_backend_and_version() which re-probes independently. Each call re-imports and re-invokes _decide_nvjitlink_or_driver().
If the result were to differ between calls within the same invocation (transient import failure recovering), the key would be internally inconsistent — validation used one backend, fingerprint used another. Even though this is unlikely in practice, computing the result once and threading it through (or caching it on the instance for the duration of one make_program_cache_key call) would be cleaner and more robust.
| if target_type not in _VALID_TARGET_TYPES: | ||
| raise ValueError(f"target_type={target_type!r} is not supported (must be one of {sorted(_VALID_TARGET_TYPES)})") | ||
| supported_for_code = _SUPPORTED_TARGETS_BY_CODE_TYPE[code_type] | ||
| if target_type not in supported_for_code: |
There was a problem hiding this comment.
Consideration: target_type is not case-normalized, asymmetric with code_type.
code_type is lowered here to ensure "PTX" and "ptx" route the same way. But target_type is not normalized — make_program_cache_key(target_type="PTX") and make_program_cache_key(target_type="ptx") produce different keys for the same compilation. The comment on lines 728–730 says "a caller that passes 'PTX' or 'C++' must get the same routing" but only addresses code_type.
Program.compile may normalize target_type internally, but standalone callers of make_program_cache_key are exposed to this instability. Consider adding target_type = target_type.lower() here to match.
| "debug": _gate_truthy, | ||
| "lineinfo": _gate_truthy, | ||
| "ftz": _gate_tristate_bool, | ||
| "prec_div": _gate_tristate_bool, | ||
| "prec_sqrt": _gate_tristate_bool, | ||
| "fma": _gate_tristate_bool, | ||
| "split_compile": _gate_identity, | ||
| "ptxas_options": _gate_ptxas_options, | ||
| "no_cache": _gate_is_true, | ||
| } | ||
|
|
||
|
|
||
| # LinkerOptions fields the ``cuLink`` driver backend silently ignores | ||
| # (emits only a DeprecationWarning; no actual flag reaches the compiler). | ||
| # When the driver backend is active, collapse them to a single sentinel in | ||
| # the fingerprint so nvJitLink<->driver parity of ``ObjectCode`` doesn't |
There was a problem hiding this comment.
Consideration: _LINKER_RELEVANT_FIELDS and _LINKER_FIELD_GATES must be kept in sync manually.
A field added to _LINKER_FIELD_GATES but forgotten in _LINKER_RELEVANT_FIELDS would be silently ignored — the gate exists but is never iterated over in _linker_option_fingerprint. The reverse (missing from _LINKER_FIELD_GATES) would KeyError at runtime, which is fail-fast but late.
A test assertion like assert set(_LINKER_RELEVANT_FIELDS) == set(_LINKER_FIELD_GATES.keys()) would close both directions cheaply. Same applies for sync with _translate_program_options in _program.pyx — the existing test_make_program_cache_key_supported_targets_matches_program_compile guards target drift but not field drift.
| def clear(self) -> None: | ||
| # Snapshot stat alongside path so we can refuse to unlink an entry | ||
| # that was concurrently replaced by another process between the | ||
| # snapshot scan and the unlink. Same stat-guard contract as |
There was a problem hiding this comment.
Nitpick: __len__ double-stats every entry.
_iter_entry_paths already filters with entry.is_file() (line 531), so every yielded path is known to be a file. The path.is_file() check here in __len__ is redundant — it issues a second stat syscall per entry. For a large cache this is 2N unnecessary stat calls. Safe to remove the inner check:
def __len__(self) -> int:
return sum(1 for _ in self._iter_entry_paths())| self._root.mkdir(parents=True, exist_ok=True) | ||
| self._entries.mkdir(exist_ok=True) |
There was a problem hiding this comment.
Nitpick: max_size_bytes=0 is accepted but makes the cache a black hole.
Both InMemoryProgramCache and FileStreamProgramCache validate max_size_bytes >= 0, so max_size_bytes=0 passes. But a zero-byte cap means every write is immediately evicted (any payload has size > 0). This is almost certainly a user error. Consider either rejecting it with a ValueError, or documenting it explicitly as "effectively disables caching."
| return None | ||
| if isinstance(v, str): | ||
| return ("-Xptxas=" + v,) | ||
| if isinstance(v, collections.abc.Sequence): | ||
| if len(v) == 0: | ||
| return None | ||
| return tuple(f"-Xptxas={s}" for s in v) | ||
| return v | ||
|
|
||
|
|
||
| _LINKER_FIELD_GATES = { | ||
| "name": _gate_identity, | ||
| "arch": _gate_identity, | ||
| "max_register_count": _gate_identity, |
There was a problem hiding this comment.
Nitpick: ptxas_options fingerprint preserves list order.
["-v", "-O2"] and ["-O2", "-v"] produce different cache keys even if ptxas treats them equivalently. This causes spurious cache misses (not collisions), so it's safe but suboptimal. The conservative choice (preserve order) is defensible if ptxas flag order is significant for some flags. Worth a brief comment noting the choice is intentional.
| ) | ||
| from cuda.core._utils.cuda_utils import ( | ||
| handle_return as _handle_return, | ||
| ) |
There was a problem hiding this comment.
Nitpick: Top-level import creates a fragile circular-import dependency.
_keys.py imports ProgramOptions from cuda.core._program at module level, while _program.pyx does a deferred import of make_program_cache_key inside compile(). The deferred import breaks the cycle today, but if someone later adds from cuda.core.utils import ... at the top of _program.pyx, it would break. Worth a # NOTE: ... comment here documenting the mutual dependency and why _keys.py gets the top-level import while _program.pyx defers.
| if extra_digest is not None: | ||
| _update("extra_digest", bytes(extra_digest)) | ||
|
|
||
| return hasher.digest() |
There was a problem hiding this comment.
Nitpick: options.name is double-hashed for the linker (PTX) path.
For PTX inputs, "name" is in _LINKER_RELEVANT_FIELDS and goes through _linker_option_fingerprint via _gate_identity. Then this universal block hashes it again under the "options_name" label. Not a collision or instability bug (hashing the same value twice under different labels is safe), but it's redundant work for the linker path. Worth a comment noting the intentional redundancy, or dedup by skipping the linker path here.
leofang
left a comment
There was a problem hiding this comment.
Follow-up batch — 6 additional inline comments from the first review round:
- Consideration (3): temp file burst-write thrashing, O(n)
_enforce_size_capon every write, UTF-8 decode introducing a new failure mode in the cache path. - Nitpick (3): stat-key triple dedup, Windows sharing-retry dedup,
_KeyBackendclass hierarchy vs simpler function dispatch.
| if self._tmp.exists(): | ||
| for tmp in self._tmp.iterdir(): | ||
| if not tmp.is_file(): | ||
| continue | ||
| try: | ||
| total += tmp.stat().st_size | ||
| except FileNotFoundError: | ||
| continue | ||
| if total <= self._max_size_bytes: |
There was a problem hiding this comment.
Consideration: Temp file accounting under burst writes can cause committed-entry thrashing.
_enforce_size_cap counts temp files toward total but can only evict committed entries (the entries list). During high-concurrency writes, many young temp files (each under the 1h stale-sweep threshold) could push total persistently above the cap with no way to recover — the loop would keep evicting committed entries while the in-flight temps hold the total above max_size_bytes. Once the burst subsides and the temps are committed/cleaned, the evicted entries are gone.
This is a soft-cap design so it's consistent with the documented contract, but worth noting that burst write concurrency can cause disproportionate eviction of committed entries.
| with contextlib.suppress(FileNotFoundError): | ||
| tmp_path.unlink() | ||
| raise | ||
| self._enforce_size_cap() |
There was a problem hiding this comment.
Consideration: _enforce_size_cap is O(n) on every __setitem__.
Every write stats all files in entries/ plus tmp/ to compute the total. For a large cache (thousands of entries), this could be measurably costly on every compile. An incremental size tracker (add on write, subtract on eviction, periodic reconciliation to correct drift from external deletions) would make writes O(1) in the common case, falling back to a full scan only when reconciliation is needed.
| # to the original str for the NVRTC/linker paths so the generated | ||
| # key matches keys callers build by passing the str source | ||
| # directly. | ||
| code_for_key = self._code if self._code_type == "nvvm" else self._code.decode("utf-8") |
There was a problem hiding this comment.
Consideration: UTF-8 decode introduces a new failure mode in the cache path.
If the source code contains non-UTF-8 bytes (e.g. Latin-1 encoded comments), this .decode("utf-8") raises UnicodeDecodeError even though the uncached compile path would succeed (NVRTC accepts raw bytes). The cache path introduces a failure that doesn't exist without cache=.
This is likely rare in practice (CUDA source is almost always ASCII/UTF-8), but worth documenting in the cache= parameter docstring or catching with a clear error message pointing to the limitation.
| if (st_now.st_ino, st_now.st_size, st_now.st_mtime_ns) != ( | ||
| st_before.st_ino, | ||
| st_before.st_size, | ||
| st_before.st_mtime_ns, | ||
| ): | ||
| return |
There was a problem hiding this comment.
Nitpick: Stat-key triple (st_ino, st_size, st_mtime_ns) is repeated in 4 places.
This exact pattern appears in _touch_atime (twice — fd-based and path-based), _prune_if_stat_unchanged, and _enforce_size_cap. Extracting a small helper would reduce duplication and make the invariant easier to audit:
def _stat_key(st: os.stat_result) -> tuple:
return (st.st_ino, st.st_size, st.st_mtime_ns)Then each site becomes if _stat_key(st_now) != _stat_key(st_before): return.
| for i, delay in enumerate(_REPLACE_RETRY_DELAYS): | ||
| if delay: | ||
| time.sleep(delay) | ||
| try: | ||
| os.replace(tmp_path, target) | ||
| return True | ||
| except PermissionError as exc: | ||
| if not _IS_WINDOWS or getattr(exc, "winerror", None) not in _SHARING_VIOLATION_WINERRORS: | ||
| raise | ||
| # Windows sharing violation; loop and try again unless this was the | ||
| # last attempt, in which case fall through and return False. | ||
| if i == len(_REPLACE_RETRY_DELAYS) - 1: | ||
| return False | ||
| return False |
There was a problem hiding this comment.
Nitpick: Windows sharing-violation retry pattern is repeated across 3 functions.
_replace_with_sharing_retry, _stat_and_read_with_sharing_retry, and _unlink_with_sharing_retry all iterate over _REPLACE_RETRY_DELAYS, sleep, try an operation, catch PermissionError, and check _is_windows_sharing_violation. A generic retry helper parameterized on the operation could deduplicate this:
def _with_sharing_retry(op, *args, on_exhausted=None, **kwargs):
last_exc = None
for delay in _REPLACE_RETRY_DELAYS:
if delay:
time.sleep(delay)
try:
return op(*args, **kwargs)
except PermissionError as exc:
if not _is_windows_sharing_violation(exc):
raise
last_exc = exc
if on_exhausted is not None:
return on_exhausted(last_exc)
raise last_excNot critical — the current code is clear and correct — but it's 3× the same loop structure.
| update(f"{label}_probe_failed", type(exc).__name__.encode()) | ||
|
|
||
|
|
||
| class _KeyBackend(abc.ABC): |
There was a problem hiding this comment.
Nitpick: _KeyBackend class hierarchy vs plain function dispatch.
Three stateless singleton classes (_NvrtcBackend, _LinkerBackend, _NvvmBackend) for what is essentially a 3-way dispatch on code_type. The ABC does document the contract clearly (6 abstract/overridable methods), but three plain functions (or a dict of named tuples of functions) dispatched from make_program_cache_key would be simpler and equally extensible — there will likely never be a 4th code_type. The class hierarchy adds indirection and cognitive overhead (readers must trace through the ABC to understand the dispatch) without a clear extensibility payoff.
Not a correctness issue — the current design works — just noting the complexity cost relative to a simpler alternative.
Summary
Adds a persistent on-disk cache for
cuda.core.Program.compileoutputs. The high-level integration is one keyword onProgram.compile:A second invocation with the same inputs short-circuits the entire NVRTC compile —
cache.get(key)(one stat + one read) and anObjectCode._initfrom the bytes. NoProgram_compileis invoked. This is the fast path the cache exists to provide:Public API
Program.compile(target_type, *, cache=...)— convenience wrapper. Derives the key, returns a freshObjectCodeon hit, stores the compile output on miss.cuda.core.utils.ProgramCacheResource— abstract bytes-in / bytes-out interface for custom backends. Providesget,update(Mapping or pairs),clear, and the mapping mutators (__getitem__/__setitem__/__delitem__/__len__).__contains__is intentionally omitted:cache.get(key)is the recommended idiom because the two-callif key in cache: cache[key]pattern is racy across processes.cuda.core.utils.InMemoryProgramCache— single-process LRU onOrderedDict,threading.RLock, size-only cap. For "compile once, look up many" workflows that don't need persistence.cuda.core.utils.FileStreamProgramCache— directory of atomic per-entry files. Safe across processes viaos.replace+ Windows sharing-violation retries onos.replace/ read /unlink.cuda.core.utils.make_program_cache_key— escape hatch when the compile inputs require anextra_digest(include_path,pre_include,pch,use_pch,pch_dir, NVVMuse_libdevice=True, NVRTCoptions.namewith a directory component).Program.compile(cache=...)rejects those compiles with aValueErrorpointing here.On-disk format
Each entry is the raw compiled binary verbatim — cubin / PTX / LTO-IR — with no pickle, JSON, length prefix, or framing of any kind. Cache files are directly consumable by external NVIDIA tools (
cuobjdump,nvdisasm,cuda-gdb).ObjectCode.symbol_mappingfromname_expressionsis not preserved across a cache round-trip; the wrapper rejectsProgram.compile(name_expressions=..., cache=...)outright so the first-call-works/second-call-breaks footgun can't surface. Callers that needget_kernel(name_expression)should compile withoutcache=.FileStreamProgramCache
tmp/, fsync,os.replaceintoentries/<2char>/<hash>. Concurrent readers never observe partial writes. Windowsos.replaceretries onERROR_ACCESS_DENIED/ERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATION(winerrors 5/32/33) within a bounded backoff (~185 ms); after the budget, the write is dropped and the next call recompiles. The same retry covers reads andpath.unlinkso eviction doesn't crash the writer that triggered it on win-64._is_windows_sharing_violation(exc)filtersEACCESonly whenwinerroris absent — non-sharing winerrors are real config errors and propagate. Off-WindowsPermissionErroralways propagates.cache[key] = value(andcache.update({key: value, ...})) accept raw bytes, bytearray, memoryview, or anyObjectCode(path-backed too — the file is read at write time so the cached entry is the binary content, not a path that could move). Reads return the same bytes that went in.max_size_bytesis the only knob — no element-count cap.Nonemeans unbounded.os.utime(fd-based on Linux/macOS viaos.supports_fd, path-based on Windows) to bumpst_atimeregardless of mount options orNtfsDisableLastAccessUpdate. Eviction sorts by oldestst_atimefirst. The atime touch is stat-guarded so a racing rewriter's freshly-replaced file never has its mtime rolled back.clear(),_enforce_size_cap(), and the atime touch all snapshot(ino, size, mtime_ns)per entry and refuse to unlink / overwrite stamps if a writer replaced the file mid-operation.make_program_cache_key): a backend-strategy pattern with one class percode_type(_NvrtcBackend/_LinkerBackend/_NvvmBackend). Each owns its own validate / encode_code / option_fingerprint / encode_name_expressions / hash_version_probe / hash_extra_payload. The orchestrator validatescode_type/target_type, dispatches to the right backend, and assembles the digest in fixed order. Adding a new backend is one new class, not a five-place edit.options.namewith a directory component: rejected withoutextra_digestbecause NVRTC resolves quoted#includedirectives relative to that directory — neighbour-header changes wouldn't invalidate the cache otherwise.RuntimeWarningthe uncached path emits — loadability depends on the driver, not on whether the bytes were freshly compiled.pathis omitted, resolves viaplatformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache":\$XDG_CACHE_HOME/cuda-python/program-cache(default~/.cache/cuda-python/program-cache)~/Library/Caches/cuda-python/program-cache%LOCALAPPDATA%\\cuda-python\\program-cachetmp/self-heal: if something deletestmp/after the cache is opened, the next write recreates it rather than crashing withFileNotFoundError.Test plan
tests/test_program_cache.py— abstract-class contract,updateaccepts mapping or pairs, transparent input-form equivalence (bytes / bytearray / memoryview / bytes-backedObjectCode/ path-backedObjectCodeall round-trip to the same on-disk bytes),make_program_cache_keysemantics (deterministic, supported-target matrix mirrorsProgram.compile, backend probe failures fail closed but stable, env-version changes don't perturb the key on the wrong backends, options-fingerprint canonicalization for the linker path, side-effect / external-content / NVRTCoptions.name-dir-component guards, schema version mixing), filestream CRUD, atomic-write race coverage, stat-guarded prune / atime-touch / clear / size-cap, atime LRU promotes recently-read, default-dir usesplatformdirs,_is_windows_sharing_violationpredicate's truth table including the regression case (non-sharing winerror plus EACCES propagates),tmp/recreation after external wipe.tests/test_program_cache_multiprocess.py— concurrent writers same key, distinct keys, reader-vs-writer torn-file safety, size-cap eviction race (rewriter vs. churner) under stat-guarded eviction.tests/test_program_compile_cache.py—Program.compile(cache=...)miss/hit/error paths against a recording stub,name_expressionsrejection,extra_digest-required / side-effect / NVRTCoptions.name-dir-component rejection, PTX loadability warning on cache hit (positive + negative), real-NVRTC end-to-end roundtrip across reopen.